Use the doc-values skip index to skip per-doc value lookups in LongRangeFacetCutter#16268
Use the doc-values skip index to skip per-doc value lookups in LongRangeFacetCutter#16268slow-J wants to merge 5 commits into
Conversation
03d7d2a to
066c419
Compare
|
I reran benchmarks, this time correctly using localrun, and updated the results in #16268 (comment) |
2e7144b to
0c72d5f
Compare
epotyom
left a comment
There was a problem hiding this comment.
Nice change! One suggestion below
1065433 to
7db2833
Compare
7db2833 to
88fe293
Compare
| } | ||
|
|
||
| /** Single-valued {@link LongValues} for {@link #skipField} in this segment. */ | ||
| final LongValues skipFieldValues(LeafReaderContext context) throws IOException { |
There was a problem hiding this comment.
Is this method a part of the single value unwrapping logic that we want to remove for now?
There was a problem hiding this comment.
Nope, this is part of the core skipper path.
There was a problem hiding this comment.
Oh I see now. The initial logic was:
if single-valued AND has DocValuesSkipper:
use the optimized version
Then in the second revision we tried:
if single-valued:
if has skipper:
use the skipper-optimized version
else:
use the single-valued optimized version
And now we are back to the first approach?
The reason I got confused is that I thought we also used the skipper optimization for multi-valued fields, but I see now that the description explicitly calls out single-valued fields only. I wonder why that is the case? I thought Lucene supported skippers for multi-valued fields as well.
There was a problem hiding this comment.
Just to clarify why I'm asking about multi valued fields: the current version still uses the unwrap-singleton optimization, but only together with the skipper. So there are basically two optimizations here, and I thought that was what you wanted to avoid?
If that’s the case, and if multi-valued fields support skippers, I’d suggest scoping this PR to the skipper optimization only, for both single- and multi-valued fields, and moving the single-valued unwrapping optimization to a follow-up PR. WDYT?
There was a problem hiding this comment.
I think I'm getting confused in the naming here. I have removed the change that was not related to skipper,
The unwrapSingleton call is used to detect the segment is single-valued and to get the single-valued NumericDocValues. The other change, now removed, caused single-valued segments with no skipper to always go to the single-valued cutter when it should depend on the existing logic.
Thanks for this and other comments, I'll look into adding multi-valued skipper when I get the time and do it in this PR.
…range-facets # Conflicts: # lucene/CHANGES.txt
| final LongValuesSource singleValues; | ||
|
|
||
| // Field name whose skip index is used on the single-valued path, or null when faceting a source. | ||
| final String skipField; |
There was a problem hiding this comment.
WDYT about renaming this to fieldName? skip is a bit confusing, since the field does not necessarily have a skipper even if this value is set.
| } | ||
|
|
||
| /** Single-valued {@link LongValues} for {@link #skipField} in this segment. */ | ||
| final LongValues skipFieldValues(LeafReaderContext context) throws IOException { |
There was a problem hiding this comment.
If we rename skipField to fieldName, maybe we should rename this method as well, perhaps to singletonFieldValues?
| /** Single-valued {@link LongValues} for {@link #skipField} in this segment. */ | ||
| final LongValues skipFieldValues(LeafReaderContext context) throws IOException { | ||
| NumericDocValues values = | ||
| DocValues.unwrapSingleton(DocValues.getSortedNumeric(context.reader(), skipField)); |
There was a problem hiding this comment.
Can we deduplicate the calls to DocValues.getSortedNumeric and DocValues.unwrapSingleton? We call them both in maybeSkipper and here for the same field.
Resolves #16249
Implementation heavily inspired by HistogramCollector.java.
Range faceting (in the sandbox module -
LongRangeFacetCutter) currently reads the doc-values value for every matching document and binary-searches it into an elementary interval. When the faceted field is single-valued, we can use a doc-values skip index. For a dense skip block whose min and max values fall into the same elementary interval, every document in that block maps to that interval, allowing us to skip the per-doc value lookup and binary search.Limitation - applies to single-valued, long fields only.
Benchmark (luceneutil)
I used my branch of https://github.com/slow-J/luceneutil/tree/github-16249-range-facet-bench which cherry picked 2 of @epotyom 's commits (mainly mikemccand/luceneutil#582 which adds range-facet support)
Setup:
runlocal.py,wikimediumall(33.3M docs), index-sorted bylastMod_skipperwithaddDVSkippers=true. baseline = main, candidate = this change, both DURING_COLLECTION, sothe only difference is this optimization. 30 JVM iterations.
Command:
python3 -u src/python/localrun.py -s rangeFacetsWikimediumAll -b lucene_baseline -c lucene_candidate -iterations 30 -warmups 20 2>&1 | tee "$BASE/run-timing7.txt"Edit: new benchmark results after the changes for Egors first 2 comments.
Edit2: new benchmark results after unwrapping removed
QPS
Latency (ms) — aggregated across all iterations